Jakstab: A Static Analysis Platform for Binaries

نویسندگان

  • Johannes Kinder
  • Helmut Veith
چکیده

For processing compiled code, model checkers require accurate model extraction from binaries. We present our fully configurable binary analysis platform JAKSTAB, which resolves indirect branches by multiple rounds of disassembly interleaved with dataflow analysis. We demonstrate that this iterative disassembling strategy achieves better results than the state-of-the-art tool IDA Pro. Introduction. While most of today’s model checkers operate on source code, there are various settings where we need to verify binary code. First, when source code is not available, e.g., when a software manufacturer wants to verify the conformance of third party modules, such as drivers or plugins, to the API specification. Second, to be able to detect errors introduced in the compiling process [1], which is of particular importance in the field of embedded systems, where compilers can be unreliable. Third, binary level analysis results can supplement execution traces collected by testing and vice versa, as demonstrated by the SYNERGY algorithm [2]. And finally, our original motivation for this research stems from using model checking to detect malicious code inside executables [3]. Extracting a control flow graph (CFG) from an executable is not simply a matter of implementing a language front-end for assembly. Compiled code lacks many comfortable properties of structured high level languages and poses several challenges for analysis tools. Function pointers are only seldom handled by source-level verification tools, but on assembly level, calls and jumps to pointers are too abundant to be ignored. The treatment of function pointers requires dataflow analysis on an incomplete CFG. Thus, the traditional sequence, in which an analyzer builds the CFG first and only then performs dataflow analysis, has to be replaced by an iterative process. Another challenge is the loss of structure in compiled code. For accurate analysis results, procedures, along with their calling conventions, need to be explicitly detected. Compiler optimizations and, worse, obfuscation techniques can further mangle the control flow structure of an executable and impede correct disassembly and control flow extraction [4]. Existing disassemblers can be divided into two categories [4]: Linear sweep disassemblers, such as GNU objdump, simply sequentially translate machine code into assembly instructions. Recursive traversal disassemblers, such as IDA Pro, follow direct branches and decode the program by depth first search. We extend this classification by ⋆ Supported by DFG grant FORTAS – Formal Timing Analysis Suite for Real Time Programs (VE 455/1-1) and the European Commission under Contract IST-2002-507932 ECRYPT. Disassembly Intermediate Representation Control Flow Graph mov esi, [0x38498] esi := mem32[0x38498]; jmp 0x1fae2 goto L2; push [ebp 4] L1: mem32[esp − 4] := mem32[ebp − 4]; esp := esp − 4; call [0x38588] esp := esp − 4; mem32[esp] := 0x1FAE2; goto mem32[0x38588]; lea eax, [ebp 4] L2: eax := ebp − 4; push eax mem32[esp − 4] := eax ; esp := esp − 4; push [ebp + 8] mem32[esp − 4] := mem32[ebp + 8]; esp := esp − 4; call esi esp := esp − 4; mem32[esp] := 0x1FAEB; goto esi; cmp [ebp 4], 0 tmp := mem32[ebp − 4] − 0; CF := tmp@31&(!mem32[ebp − 4]@31); OF := mem32[ebp − 4]@31&(!tmp@31); NF := tmp@31; if (tmp = 0) then ZF := 1 else ZF := 0; jne 0x1fad9 if (ZF = 0) then goto L1; [0x1faca] %esi := mem32[0x38498] mem32[%esp 4] := %ebp mem32[%esp 12] := %esi %esp := (%esp 12) [0x1fae2] %eax := (%ebp 4) %esi := mem32[0x38498] mem32[%esp 8] := mem32[%ebp + 8] mem32[%esp 4] := (%ebp 4) mem32[%esp 12] := 0x1faeb %esp := (%esp 12) [[email protected]] %esp := (%esp + 4) %eax := nondet32 %ecx := nondet32 %edx := nondet32 %pc := mem32[%esp] [0x1faeb] tmp1 := mem32[%ebp 4] %ZF := mem32[%ebp 4] == 0) ? 1 : 0 %ZF = 1 [0x1fad9] mem32[%esp 8] := 0x1fae2 mem32[%esp 4] := mem32[%ebp 4] %esp := %esp 8 %ZF = 0 [[email protected]] %esp := (%esp + 8) %eax := nondet32 %ecx := nondet32 %edx := nondet32 %pc := mem32[%esp] L1: L2: Fig. 1. Part of procedure 0x1FACA in fwdrv.sys. The second call is not resolved by IDA Pro. defining an iterative disassembler as one that interleaves multiple disassembly rounds with dataflow analysis to achieve accurate and complete CFG extraction. Our tool JAKSTAB3 (Java toolkit for static analysis of binaries) serves as a flexible front end to make executables accessible to static analysis and model checking. To this end, JAKSTAB contains an iterative disassembler and a library of semantic descriptions that translates assembly instructions to an RTL-style intermediate representation. Disassembler and semantic descriptions are fully configurable to support multiple target platforms. Using the intermediate representation, JAKSTAB iteratively creates the CFG, calculating and resolving indirect branch targets using results from dataflow analysis. JAKSTAB is implemented in Java and can be either used as a library or via its command line interface, which outputs plain disassembly or the intermediate representation as a CFG in graphviz-format. The intermediate representation, consisting of assignments, if, and goto statements, is independent of the target hardware and provides a natural interface to model checkers and program analysis tools. Today’s de facto industry standard for disassembly is IDA Pro. Its heuristic matches common prologue bytes to identify procedures and assumes that every call returns to its original site, regardless of the call target, which can lead to erroneous fall-through edges. Furthermore, the CFG is usually incomplete, since IDA Pro has only a very basic ability to resolve indirect branch instructions (function pointers): It propagates constants just within a basic block, and decorates calls to such constants with comments containing the actual target. While this is enough to aid human engineers, it is insufficient for automated analysis. Figure 1 shows an exemplary piece of assembly code from a Windows driver executable (fwdrv.sys from Sunbelt Personal Firewall), where IDA Pro (v4.7) fails to identify an indirect call to an imported function, whose address 3 Project page online at http://www.jakstab.org is stored at a memory location pointed to by the register esi. Finally, even though IDA Pro offers an (unsupported) SDK for plugin development, it is closed source software and thus cannot be easily integrated with an analysis tool. To the best of our knowledge, the most successful approach to static analysis of executables currently is the CodeSurfer/x86 project [5]. CodeSurfer/x86 uses IDA Pro to access binaries, and combines two program analysis algorithms, value set analysis (VSA) and aggregate structure identification (ASI). In recent work, they combined VSA with a property automaton that encodes certain usage rules for the Windows driver API [6]. Generally, they assume a standard compilation model for binaries, which guarantees correct disassembly by IDA Pro. They acknowledge that IDA Pro’s output can be incomplete and do connect missing edges from indirect calls, yet they lack a complete loop to disassemble previously unprocessed branch targets. Closely related to executable analysis is the idea of building a decompiler, which transforms an executable back to source code [7, 8]. Chang et al. describe an architecture of communicating decompilers at different language levels [9]. Their implementation propagates static analysis facts through all language levels one instruction at a time, instead of strictly separating decompilation stages by language level. The prototype targets assembly source files generated by a set of compilers, and thus requires access to source code. We believe that JAKSTAB would fit nicely into this tool-chain as a provider of well-formed CFGs from generic executables. Control Flow Reconstruction. In most assembly languages, instructions can affect multiple registers and status flags. The x86 architecture, which we first focused on, features an especially rich instruction set where instructions often represent non-trivial operation sequences. To fully capture instruction semantics and enable easy extensibility, JAKSTAB is designed to read Semantic Specification Language (SSL) files supplied with the Boomerang decompiler, which are available for several architectures including x86, PowerPC, 68K, and SPARC [10, 8]. Figure 1 shows the intermediate representation JAKSTAB produces from the assembly snippet using SSL definitions for the x86 architecture. Mapping every assembly instruction to its semantic specification creates a program representation with obvious pieces of dead code. In particular, most of the status flags are not used but simply overwritten by later instructions. To reduce the program size, our tool executes a live variable analysis and afterward removes any dead code. In our experiments, usually about 30% of the statements are identified as dead code and removed from the control flow graph. In the example in Figure 1, three flag updates are removed (crossed out text), and only one relevant update remains. JAKSTAB recreates the control flow graph in an iterative process. Starting from the entry point of the executable, it propagates and folds constants through registers and memory cells to resolve indirect branch targets. JAKSTAB supports indirect memory access, which is common for local variables stored on the stack. Whenever Jakstab cannot resolve the address of an indirect write, it currently assumes that every memory cell can become undefined. Calls to shared libraries, which, in the Windows PE-format, appear as indirect calls to memory locations, are handled by creating stub procedures in the control flow graph. Constant propagation and folding is performed on all parts of the CFG already known, which allows JAKSTAB, in contrast to IDA Pro, to successfully recover the CFG of the example in Figure 1. Note that the results of constant propagacmd.exe dnsrslvr.dll faultrep.dll ftp.exe nmnt.sys rcp.exe svchost.exe IDA Pro 74% 9.4s 81% 36.2s 73% 5.4s 88% 2.4s 74% 3.1s 42% 1.4s 56% 1.5s JAKSTAB 91% 32.4s 92% 3.2s 98% 9.0s 94% 2.7s 96% 4.5s 100% 1.1s 88% 1.0s Fig. 2. Success rates and processing times for resolving indirect branches in executables. tion can theoretically be incorrect if incoming edges to existing nodes are discovered in later iterations. In such cases, the CFG reconstruction process has to be restarted. Any target location that has been successfully resolved in one iteration is scheduled for disassembly in the next one. Newly detected procedures are inlined to ensure correct interprocedural results in the next round of constant propagation. Figure 1 shows the CFG extracted from the example code, including stubs for imported library functions. The stubs non-deterministically assign those registers which might be overwritten by library functions (eax, ecx, edx according to the Intel application binary interface). We compared JAKSTAB’s and IDA Pro’s capabilities in resolving indirect branches on Microsoft Windows system binaries. The results we present in Fig. 2 clearly show that JAKSTAB is able to provide significantly more accurate CFGs than IDA Pro at similar, and in some cases even faster, execution speeds. Applications and Future Work. Our goal is to use JAKSTAB as a versatile platform for different verification tasks on binary level. Currently, we are building a bounded model checker on top of the existing framework to allow better resolution of indirect jumps and the extraction of all targets from jump tables. Besides the internal use of the bounded model checker for improving the CFG, we will investigate what kind of specifications can be verified on binary level, with particular focus on API usage specifications. JAKSTAB, unlike IDA Pro, does not assume a standard compilation model. Therefore it is well suited to process code protected against disassembly, in particular malicious code. Anti-disassembly patterns that obscure the control flow of a program will thwart traditional recursive traversal disassemblers [4]. For example, return instructions are commonly misused as generic jumps by pushing the desired target address on the stack immediately beforehand. Since JAKSTAB supports local constant propagation through the stack, it can retarget disassembly correctly in these cases and is able to recover the real control flow. A CFG extracted from such a potentially malicious program can then be used as input to a semantic malware detector [3].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pushdown Model Generation of Malware

Model checking software consists of two steps: model generation and model checking. A model is often generated statically by abstraction, and sometimes refined iteratively. However, model generation is not easy for malware, since malware is often distributed without source codes, but as binary executables. Worse, sophisticated malware tries to obfuscate its behavior, like self-modification, whi...

متن کامل

Binary Rewriting without Relocation Information

Binary rewriting softwares transform executables by maintaining the original binary’s functionality, while improving it in one or more metrics, such as runtime performance, energy use, memory use, security, and reliability. Existing static binary rewriters are unable to rewrite binaries that do not contain relocation information, which is typically discarded by linkers unless specifically instr...

متن کامل

CytoModeler: a tool for bridging large-scale network analysis and dynamic quantitative modeling

SUMMARY CytoModeler is an open-source Java application based on the Cytoscape platform. It integrates large-scale network analysis and quantitative modeling by combining omics analysis on the Cytoscape platform, access to deterministic and stochastic simulators, and static and dynamic network context visualizations of simulation results. AVAILABILITY Implemented in Java, CytoModeler runs with...

متن کامل

An Abstract Interpretation-Based Framework for Control Flow Reconstruction from Binaries

Due to indirect branch instructions, analyses on executables commonly suffer from the problem that a complete control flow graph of the program is not available. Data flow analysis has been proposed before to statically determine branch targets in many cases, yet a generic strategy without assumptions on compiler idioms or debug information is lacking. We have devised an abstract interpretation...

متن کامل

DyVSoR: dynamic malware detection based on extracting patterns from value sets of registers

To control the exponential growth of malware files, security analysts pursue dynamic approaches that automatically identify and analyze malicious software samples. Obfuscation and polymorphism employed by malwares make it difficult for signature-based systems to detect sophisticated malware files. The dynamic analysis or run-time behavior provides a better technique to identify the threat. In t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008